Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Distributed fault localization framework for high performance computing

GAO Jian, YU Kang, QING Peng, WEI Hongmei

Journal of Computer Applications 2018, 38 (1): 44-49. DOI: 10.11772/j.issn.1001-9081.2017071948

Abstract （543）

PDF （981KB）（362）

Save

To solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a Message-Passing based Fault Localization (MPFL) framework was proposed, which included Tree-based Fault Detection (TFD) and Tree-based Fault Analysis (TFA) algorithms. Firstly, when the parallel application was initialized, the Fault Localization Tree (FLT) was obtained by logically dividing all the nodes participating in the computing, and the fault localization tasks were distributed to different nodes. Secondly, if the abnormal state of a node was detected by system components such as message-passing library and operating system, the TFD algorithm was used to analyze the FLT structure, and the node responsible for receiving the abnormal state was selected according to factors such as load balancing and performance cost. Finally, the fault was derived from the received abnormal state, which was reasoned by the node that used TFA algorithm. The rule-based event correlation and the lightweight active probing based on message-passing were used in TFA algorithm, and the accuracy of fault analysis was improved by combining these two approaches. The experimental evaluation was performed on a typical cluster, which demonstrated the capability of MPFL by locating the shutdown simulation nodes. The experimental results on the NPB-FT and NPB-IS benchmarks show that the MPFL framework has excellent performance on fault localization capability and cost saving.

Reference | Related Articles | Metrics

Select

Link fault monitoring in optical networks based on wavelet transform

XIONG Yu LIU Xiaoqing PENG Haiying WANG Ruyan

Journal of Computer Applications 2013, 33 (02): 382-399. DOI: 10.3724/SP.J.1087.2013.00382

Abstract （838）

PDF （651KB）（340）

Save

The traditional fault monitoring methods have some problems such as great deviation and slow speed. To solve these problems, a link fault monitoring algorithm based on the wavelet transform was presented. This algorithm used the dynamic polling scheme to detect the optical power and used the local characteristic in time-frequency domain of the wavelet transform to extract the fault information from the detection value. The monitoring optical power value was decomposed with multi-scale to eliminate the effects of noise, thereby improving the accuracy of the fault monitoring. The experimental results show that compared to the analytucal methods in time domain, the proposed fault monitoring algorithm based on wavelet transform is better to overcome the effects of noise. The leakage alarm rate is reduced to zero and the false alarm rate is decreased by five percentage. The monitoring time is between 2.53ms and 3.12ms, which can meet the real-time requirement.